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ABSTRACT 

The increasing parallelism of many-core systems demands 
for efficient strategies for the run-time system management. 
Due to the large number of cores the management overhead 
has a rising impact to the overall system performance. This 
work analyzes a clustered infrastructure of dedicated hard¬ 
ware nodes to manage a homogeneous many-core system. 
The hardware nodes implement a message passing protocol 
and perform the task mapping and synchronization at run¬ 
time. To make meaningful mapping decisions, the global 
management nodes employ a workload status communica¬ 
tion mechanism. 

This paper discusses the design-space of the dedicated infras¬ 
tructure by means of task mapping use-cases and a parallel 
benchmark including application-interference. We evaluate 
the architecture in terms of application speedup and analyze 
the mechanism for the status communication. A compari¬ 
son versus centralized and fully-distributed configurations 
demonstrates the reduction of the computation and com¬ 
munication management overhead for our approach. 

General Terms 

design, architecture 

Keywords 

many-core, embedded system, run-time management, mes¬ 
sage passing, task mapping, dedicated hardware 

1. INTRODUCTION 

Power-efficiency and scalability has been a driver for a va¬ 
riety of cluster-based many-core systems. Among them, 
the P2012 (a.k.a. STHORM) many-core architecture, the 
MPPA manycore and the Single-Chip Cloud Computer SCC 
have recently been implemented as real-world hardware in¬ 
stances 0 © 11 . Their designs address power budgets 
ranging from 2W to 125W and incorporate a multitude of 
architectural features and programming models. 

The domain of many-cores leads to the demand for a sophis¬ 
ticated (re-)design of the run-time task management. The 
task management has to bring the dynamic requirements 
of the user applications into accordance with the monitored 
state of the chip. Also, a task manager is responsible for al¬ 
locating the resources (1) computation, (2) communication 
and (3) memory to the applications. Hardware-assistance 
has become a key factor to reduce the overhead introduced 
by the run-time task manager 191. 


The idea of hardware task scheduling can be tracked back to 
the POLYP mainframe computer jl8]. An overview about 
separate task synchronization subsystems is given by Herk- 
ersdorf {l0 . But to our best knowledge, we are the first 
to present and to analyze a full-fledged on-chip task man¬ 
agement infrastructure using a dedicated infrastructure of 
hardware nodes. Key objective of the dedicated infrastruc¬ 
ture is to conceal the resulting management overhead from 
the user tasks. 

The remainder of this paper is organized as follows: Section [2] 
discusses related work. In Section [3] we present our pro¬ 
posed architecture and Section [4] introduces the run-time 
task manager. Section [5] shows experimental results, and 
finally Section [6] concludes the paper. 

2. RELATED WORK 

A hardware-assisted run-time software for embedded many- 
cores is presented by HARS 17 . But, while using the hard¬ 
ware semaphores included in the STHORM many-core ar¬ 
chitecture their evaluation is limited to intra-cluster task 
synchronization. 

A distributed run-time is proposed for the MPPA [7]. The 
run-time environment exploits a dedicated system core which 
acts as a resource manager inside a single cluster. However, 
their approach is constrained to a compile-time (static) map¬ 
ping scheme. 

The SCC comes with a default Linux configuration and the 
message passing programming model. Also, basic synchro¬ 
nization primitives are implemented in hardware [20j. The 
SCC consists of small-size clusters which yet not contain a 
dedicated management core. 

Besides clustered solutions there exist centralized as well as 
fully-distributed approaches. Nexus-1—I- uses a single appli¬ 
cation specific circuit resolving time-critical task dependen¬ 
cies at run-time [5] and applies a trace-based description of 
a H.264 benchmark. A distributed and dedicated hardware 
approach has been implemented by Isonet [15]. Isonet ap¬ 
plies a fully-distributed network of dedicated management 
nodes for hardware supported load balancing. 

3. SYSTEM ARCHITECTURE 

This paper is a continuation of our work presented in [8] and 
analyzes a clustered architecture for the task management. 
Our overall system architecture is constructed by a homo- 


geneous many-core system as a baseline which is enhanced 
by a dedicated management infrastructure. The dedicated 
management infrastructure is implemented as a network of 
global management nodes and clusters of local controllers. 
Each local controller is tightly coupled to a processing el¬ 
ement. The global management nodes are connected to a 
global interconnect. A local interconnect links one global 
management node with its local controllers. Fig. [I] gives an 
outline of the proposed architecture. The interconnects are 
implemented as (but not restricted to) shared buses. A com¬ 
mon interconnect between the processing elements is left out 
for better readability. The communication between the dedi¬ 
cated nodes is done by means of message passing. Each node 
contains message queues for transmission and reception. 
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Figure 1: Outline of the system architecture for the clustered 
task management. Having k = 4 global management nodes 
(GMN) and m = 16 local controllers (LC) coupled to the 
processing elements. 


3.1 Global Management Nodes 

Each of the global management nodes runs one instance of 
the run-time task manager in software and contains dedi¬ 
cated hardware for message processing. The communication 
between the nodes is determined by the message protocol ex¬ 
plained in Sec. |3.3| The execution of the system-calls from 
the user tasks is realized by the global nodes. Additionally, 
they implement a hierarchical task mapping algorithm and 
a cluster status communication mechanism at run-time (see 
Sec. |4j. 

The global management nodes demand for programmabil¬ 
ity and for a minimal area footprint. Messaging between 
the nodes requires for fast interrupt handling. We plan to 
implement the global management nodes by means of pro¬ 
grammable stack machines. Stack machines have very-low 
hardware complexity [l6], exhibit high performance in sub¬ 
routine calls (context switching), and achieve determinis¬ 
tic time for interrupt handling |13] 112]. Another advan¬ 
tage is the small code size of programs written for stack 
machines [3]. Small memory footprints allow to spend each 
global node its own program memory, which diminishes com¬ 
munication overhead for instruction fetching. 


3.2 Local Controller 

A local controller (LC) is tightly coupled to a processing 
element (PE) for user task execution. The PE contains a 
functional model of a RISC-like processor architecture and 
executes a trace-based description language. The traces are 
used to raise the system-calls given in Tab. [2] and determine 
the application behavior (see Sec. [5|. 

The local controller maintains a system call dispatcher for 
low-latency response and has access to the PE registers. The 
dedicated LC can be implemented with low area overhead 9 
and operates in parallel to the PE. Any system-call from a 
user task is fetched by the LC and forwarded to its global 
node by means of a dedicated message. Due to the dedicated 
infrastructure for the task management the PE does only 
execute the user tasks. 

3.3 Messaging Protocol 

We send messages via the dedicated interconnects for com¬ 
munication between the hardware nodes. The message pass¬ 
ing combines data transport and run-time system synchro¬ 
nization. Each message has a header and one or more 32- 
Bit data fields. Table [l] displays the structure of a message. 
The header contains the message type, at least the source 
address, the priority and a broadcast flag. The size of the 
message header depends on the actual hardware configura¬ 
tion (i.e. number of nodes / address-width). 

Table 1: Message structure 


type 


dst 


prio 


flag 


data 


Most of the message types directly correspond to the system 
calls given in Tab. [2] and are send from a local controller to 
its global node. Beyond that, a message task-start invokes 
the start of a task. That message transports the address of 
the task-control-block (see [l4j) and the stack-pointer 
as message data, and can be send from a global node to a 
local controller as well as to another global node. Further, 
the global nodes use the message status-beacon to broad¬ 
cast the current workload status (see Sec. |4.2[ ) to all other 
global nodes. 

4. TASK MANAGER 

The task manager we use is loosely based on the Micro- 
C/OS-II [l4] software operating system. We do not adapt 
real-time capabilities but extended the task manager to have 
basic multi-core functionality. The extensions to the task 
manager are explained in Sec. |4.1| and |4.2| We replaced 
the task scheduler to employ a simple first-come-first-serve 
strategy. 

The system-calls, which we apply throughout this paper are 
explained in Tab. [2] We use a customized join/barrier mech¬ 
anism to synchronize the user tasks. To reduce the number 
of system-calls, a child task is allowed to exit immediately, 
when signalizing a join-exit. 

4.1 Task Mapping 

The task mapping algorithm is part of the software OS and 
is implemented inside every global node. Since our targeted 
task scheduling problems consist of task sets having a large 



















































































Table 2: System calls 


Name 

Par am. 

Description 

rcsv-spwn 

imem, 

dmem, 

cnt 

Spawn new recursive task of 
given count (cnt) and instruction- 
(imem) and data (dmem) memory 
addresses 

rcsv-exit 

addr 

Terminate task 

join-init 

cnt 

Initialize join barrier with given 
initial count and return its address 

to user 

join-free 

addr 

Free join barrier from memory 

join-wait 

addr 

Let task wait until counter is zero 

join-exit 

addr 

Decrement counter and terminate 
task 


number of tasks, we use a recursive task spawning/fork strat¬ 
egy. Every recursive task spawns two additional helper task 
and then blocks until its child’s have terminated. The re¬ 
cursion is executed until one of the following stop conditions 
is reached: 


mapped tasks. The mechanism triggers a broadcast, every 
time a certain threshold Anth in change of the number of 
mapped tasks is reached. 

5. EVALUATION 

We employ a simplified task-based programming model for 
our analysis. Parent tasks may spawn numerous child tasks 
and wait until their computation has finished. Our main 
criterion for evaluation is the throughput time t r (response 
time) of the overall application (parent T childs). We mea¬ 
sure the speedup as the ratio of the sequential throughput 
time t r , t ieq vs. the achievable parallel throughput time t rtPar 
and show that the achievable speedup S = t r , S eq/t r ,par is 
either limited by the computation or communication man¬ 
agement overhead. 

5.1 Analytic Model: 

Having n independent child tasks of equal length l, m homo¬ 
geneous processing elements and k global management nodes 
the maximal achievable speedup is limited by a temporal 
management overhead fl(m, n, k) as shown in Eqn. 0: 


1. The number of remaining child tasks is smaller than 
or equal to the number of PEs per cluster 

2. The number of active helper tasks is greater than or 
equal to the number of clusters 

The recursive start-up follows a dynamic cluster mapping 
procedure, which tries to equally distribute the recursive 
helper tasks onto the clusters. After the binary fork-tree has 
stopped to expand, the actual child tasks of the application 
are spawned. This final number of working child tasks is 
fixed and determined by the application profile. 

The mapping problem is therefore split into two stages: At 
the first stage, the mapping algorithm is responsible for se¬ 
lecting the global nodes (clusters), where the helper tasks 
get mapped to. At the second stage, the mapping algorithm 
selects the local processing elements, where the actual child 
tasks get mapped to. Each single mapping decision is done 
by means of a min-search. The mapping algorithm chooses 
that node with the minimal number of mapped tasks. To 
do this, every global node maintains a data structure about 
the per-PE workload inside his private cluster and a data 
structure about the summarized workload for each remote 
global node. In the current implementation, we estimate the 
workload by counting the total sum of locally mapped tasks. 

Mapping is done only once, we do not allow a task to restart 
at any different location (run-time migration), since these 
operations usually come at a high performance penalty [2] 
and are not in the focus of our analysis. 

4.2 Status communication 

Communicating the workload status is required for allow¬ 
ing the mapping algorithm to make meaningful decisions. 
Due to the shared nature of the global bus interconnect we 
use a broadcast message to inform all collaborating nodes 
about the local workload. We implemented a threshold- 
based mechanisms for broadcasting the total sum of locally 


_ tr,seq _ 71-1 _ 71-1 

tr,par tr,par ^Tlf 77lj\ ■ / 12(771, 71, k'j 

Due to the considered run-time computation of the mapping 
problem there is a computation overhead flcmp- Having mul¬ 
tiple global nodes k there is an overhead f2 msg in commu¬ 
nication. We constitute the overall management overhead 
fl depending on the number of processing elements m, the 
number of user tasks n and the number of global manage¬ 
ment nodes k by equation @: 

D(m, 7i, k) = D cmp (m, n, k ) + D msff (m, n, k) (2) 

Each decision of our task mapping algorithm (See Sec. |4.1[ ) 
infers a selection time overhead D s . Due to the recursive 
task startup there is a logarithmic dependency (logn) for 
the global mapping stage. The resulting overhead fi cmp for 
computing the mapping problem of n user tasks is given by 
equation 

, . . map local 

map global ^ 

w /-~-s 

z n yi / TYl \ 

flcm P (rn,n, k) = log(n) ■ fl s (k) + - ■ Q a — j (3) 

The required search function for the mapping algorithm can 
be implemented having logarithmic time-complexity ©(log u) 
by e.g. Red-Black Trees [4]. The selection time fl s for one 
decision of the mapping is modeled as = c 3 • log v; where 
v is the number of nodes to be searched through and c s is 
a timing parameter of our framework (see Tab. [3|. Corre¬ 
spondingly, the communication overhead due to intra- and 
inter-cluster messaging is approximated by means of Eqn. [4] 

, , , local 

global ^ 

fimss (m, 71, k) = Cb ■ k + Cb ■ ^ 

k 


(4) 




















(a) Analytic model for the speedup using the recursive 
task startup. Having m = 256 PEs and n = 256 
child tasks for a varying number of global nodes k 
and coefficient c s 


(b) Measured result for the speedup using the recur¬ 
sive task startup. Having m = 256 PEs and n = 256 
child tasks for varying global nodes k and the delay 
coefficient c s 


Figure 2: Independent tasks on 256 homogeneous processing elements 


Table 3: Default parameters for the analytic evaluation and 
the transaction level simulations 


Name 

Value 

Number of processing elements 

256 

Global bus width 

32 bit 

Local bus width 

32 bit 

Message receive delay (c&/2) 

4 Ticks 

Message transmit delay (c&/2) 

4 Ticks 

Selection delay coefficient (c s ) 

8 Ticks 

Max. child task length 

16000 Ticks 

Simulation length 

le7 Ticks 


Eqn. [4] introduces the timing parameter Cb to model the 
time delay inquired by communication messages. In Fig. [2a] 
the projected speedup is plotted for the analytic model. We 
set to = 256 PEs and n = 256 child tasks while varying the 
number of global nodes and the coefficient c s . As indicated, 
the recursive startup and task mapping favors a number of 
32 — 64 global management nodes. 

5.2 Experimental Setup 

We use the transaction-level simulator presented in [8] to 
evaluate our architecture and to compare the analytic model 
against the simulation result. Table [3] gives the default pa¬ 
rameters for our model. Our evaluation ignores wire capac¬ 
itances, which factual privileges fully-centralized or fully- 
distributed configurations with a large number of nodes at¬ 
tached to the local or global interconnects. To eliminate the 
effect of bottlenecks at the interconnects we previously an¬ 
alyzed and set the bit-width of the buses to a convenient 
value of 32 bit. 

5.3 Independent Tasks 

The benchmarks are modeled by means of a trace description 
language. The traces describe the computation and mem¬ 
ory access patterns of the tasks as well as the calls to the 
run-time services (system calls). The traces are interpreted 
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Figure 4: Periodic start-up sequence for two competing ap¬ 
plications with inter-arrival time A and n child tasks 


and executed by the model for the processing elements. In 
our current analysis we use a synthetic parallel benchmark 
consisting of n independent tasks without any memory ac¬ 
cess. Fig. |2b| shows the measured speedup fitting quite well 
to the analytic description due to the regular nature of the 
benchmark. 


5.4 Application Interference 

In the second experiment we included interference between 
two competing applications having equal priority. The appli¬ 
cation start-up sequence with inter-arrival time A as shown 
in Fig. [4] is repeated periodically. The inter-arrival time A is 
Poisson distributed and has a mean value of A = 7999 Ticks. 
The number of processing elements is to = 256 and each ap¬ 
plication has n = 100 child tasks. The child task length has 
a uniform distribution between 95 - 100 % of the maximum 
computation time. The synchronization between the par¬ 
ent and the child tasks is done by means of the fork/join 
mechanism presented in Sec. [I] The stimulus is active for 
90 % of the simulation time and is send with highest pri¬ 
ority directly to a randomly chosen global node. The other 
global nodes are kept agnostic about arriving applications 
and must update their information according to the pre¬ 
sented status communication (see Sec. 4.21. We do not 
display any values, where the number of completed applica¬ 
tions differs from the number of injected ones (no misses are 
allowed). 
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(a) Evaluation of the speedup versus the threshold A nth for 
the threshold-based workload status communication mech¬ 
anism having different numbers of global nodes 


(b) Total number of transmitted beacons for workload sta¬ 
tus communication versus the threshold Any, having dif¬ 
ferent numbers of global nodes 


Figure 3: Application interference on 256 homogeneous processing elements 


Fig. [3a] shows the resulting application speedup for the 
hierarchical task mapping algorithm (see Sec. [4| and the 
threshold-based status communication mechanism. Using 
k = 16 global hardware nodes and a threshold A n t h = 4 a 
speedup improvement by a factor of 2.8 compared to k = 1 
is achieved. For a fully-distributed configuration the im¬ 
provement factor is only around 1.6 compared to k = 1. 
Using the given benchmark, the threshold based mechanism 
reveals a robust load balancing as long as the threshold is 
smaller than the number of processing elements per cluster. 

Further, we display the number of transmitted status bea¬ 
cons for the threshold-based mechanism in Fig. |3b| The fig¬ 
ure gives an indicator about the required energy for status 
communication, which is related to the number of received 
beacons. Every transmitted beacon must be received by all 
remote nodes to fully synchronize the network. For a thresh¬ 
old of Anth = 4 it is indicated that a fine-grained clustered 
configuration with k = 32 management nodes must transmit 
an amount of beacons that is around 1.37 higher compared 
to a configuration having k = 16 nodes. 

For a preliminary area analysis we compare an in-house im¬ 
plementation of a dedicated 32-Bit stack machine as global 
management node (GMN) to an mLite/PLASMA CPU [2l] 
as processing element. Both designs have been synthesized 
using an industrial 65nm low-power technology (see Tab. [d|. 
When disregarding an additional multiplier having 3547 fim 2 
shipped inside the mLite, we still can report around 25% less 
area for the stack machine. 


Table 4: Synthesis results for 65nm low-power 


Unit 

Comb. [fim z ] 

Non-comb. [fim z ] 

T c ik 

GMN 

9290.4 

9881.2 

1.77 ns 

mLite 

16268.4 

12909.5 

1.79 ns 


5.5 Summary 

In Tab. M we summarize the results of our evaluation in 
terms of application speedup using the presented hardware 


Table 5: Comparison of Speedup ( S = t S£q /t P ar) for n = 100 
independent tasks on m = 256 PEs using different numbers 
of cluster (global nodes) k. 


k 

Speedup 

Ref. 

1 

28.1 

Centralized, like e.g. Nexus-|—f- 5 

8 

73.5 

this work 

16 

78.7 

this work 

256 

44.3 

Distributed, like e.g. Isonet 15 


infrastructure. As a comparison we give our obtained values 
for a fully-centralized configuration (like e.g. Nexus-|—I- |) 
and a fully-distributed one (like e.g. Isonet 15]). The table 
indicates the significant impact of the management over¬ 
head, which was constituted by Eqn. § and Q. As a 
further work, we plan to consider a cycle-accurate model of 
the task manager and analyze the overall power consump¬ 
tion of the system. To get a more realistic scenario about the 
user applications, their memory access will be considered as 
well. 

6. CONCLUSION 

A dedicated infrastructure of hardware nodes for run-time 
task management has been introduced. Compared to pre¬ 
vious works we consider a full-fledged and separated task 
management infrastructure. The infrastructure uses a mes¬ 
sage passing protocol and allows a design trade-off between 
the advantages of centralized and fully-distributed architec¬ 
tures by choosing an optimal cluster size. 

We analyze the clustered architecture by means of an ana¬ 
lytic description as well as by transaction level simulations 
using a parallel benchmark including application interfer¬ 
ence. Our simulations revealed significant impact of the 
management overhead to the overall system performance. 

The management overhead for the task mapping problem 
can be reduced by using our infrastructure and a two-stage 
task mapping approach. Having m = 256 processing ele- 










































ments and choosing the optimal cluster size can provide a 
performance improvement by a factor of 2.8 compared to a 
single-cluster/centralized configuration. 

The results further show the dependency of the run-time 
management system on the status information from remote 
clusters. The lack of information may lead to inappropri¬ 
ate mapping decisions causing a performance drawback. We 
measured the communication overhead by counting the num¬ 
ber of status beacons transmitted by the global management 
nodes. Using a threshold-based mechanism for status com¬ 
munication and the optimal cluster size, we measured a sig¬ 
nificant reduction in terms of transmitted synchronization 
messages compared to more fine-grained clustered configu¬ 
rations. 
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